Introduction to Automated Text Analysis

Basics and Bag-Of-Words

Author

Carsten Schwemmer

Published

26.07.2023

Before we start

Please make sure your R and R Studio environment is up to date as we will make use of recent R features (such as the native pipe operator |> ) and recent documentation packages (Quarto)

Access to material

Packages

These are the new packages that we will need today:

Code
install.packages(c( 'quarto', 'stm', 'stminsights', 'textdata',  
                   'quanteda', 'quanteda.textstats', 
                   'quanteda.textmodels', 'quanteda.textplots',
                   'tidymodels'))

Literature recommendation

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.

Basics of quantitative text analysis

Character encoding

Computers store text in form of digits. For each character, e.g. a, there is a corresponding sequence of numbers used for character encoding.

https://www.asciitable.com/

Character encoding

Today, many different standards for text encoding exist (e.g. ASCII, UTF-8, UTF-16, Latin-1, ..). If you read textual data and declare the wrong encoding, some characters will not be parsed correctly:

Code
all_good <- readLines('data/encoding_issues_utf8.txt',
                     encoding = 'UTF-8')
all_good
[1] "NLP rocks"                 "Éncôdíng_cäuses_headaçhe$"
Code
suffering <- readLines('data/encoding_issues_utf8.txt',
                       encoding = 'latin1')
suffering
[1] "NLP rocks"                      "Éncôdíng_cäuses_headaçhe$"

Character encoding - advice

  • store your data in one consistent encoding to avoid headaches
  • utf-8 is commonly used and the default encoding for many R packages (e.g. tidyverse)
  • if you don’t know the encoding of a text, try several encodings and qualitatively inspect results

Text is complex

  • “Time flies like an arrow, fruit flies like a banana”
  • “Make peace, not war; make war, not peace”
  • quantitative models (need to) simplify textual data and will fail to capture some complexity

How can we deal with this complexity?

“We destroy language to turn it into data” - Ken Benoit, IC2S2 2019

Tokens

“Make love, not war.” -> c('make', 'love', 'not', 'war')

  • tokens define semantic elements of strings; important for many applications of text analysis (e.g. counting)

  • predominantly separated by white space and punctuation marks

  • converting text to tokens raises several questions:

    • what unit of analysis? words? sentences?
    • which rules (algorithms) shall be used to define tokens?
    • which tokens are not useful for the analysis?

From tokens to “bag of words” (bow)

  • disassembling texts into tokens is the foundation for the bag of words model (bow)
  • bow is a very simplified representation of text where only token frequencies are considered
  • advantages: easy to implement, scales well, often good enough
  • disadvantages: discards language properties like polisemy and word order

Bag of words - example

First dataset for today

  • we will be using a sample from a Kaggle Data Science for Good challenge
  • DonorsChoose.org provided the data and hosts an online platform where teachers can post requests for resources and people can make donations to these projects
  • the goal of the original challenge was to match previous donors with campaigns that would most likely inspire additional donations
  • the dataset includes request texts and context information, such as sex of teachers and school locations.

What could we learn from this data?

Examples of questions we might ask:

  • how has classroom technology use changed over time? How does it differ by geographic location and the age of students?
  • how do the requests of schools in urban areas compare to those in rural areas?
  • what predicts whether a project will be funded?
  • how do the predictors of funding success vary by geographic location? Or by economic status of the students?

Loading packages and data

Code
library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(lubridate)
df <- read_csv('data/donors_choose_sample.csv')
df |> pull(project_title) |>  head(3)
[1] "Books for Brains"                                                
[2] "Softball Teams Need Gloves and Ball Part 3"                      
[3] "Reading and Math Games for Sneaky Learning (They'll Never Know!)"

Text example for one donation request

Code
cat(df$project_essay[1])
A typical day in our room consists of a lot of questions coming from interested minds. Our students' brains are like sponges at this age, soaking up every bit of information they can. However some times there are underlying factors that students struggle with on a daily basis that clouds their learning. 

We have really amazing students at our school and I couldn't be more proud of their thirst for learning!

Our school is located in intercity urban San Antonio. Due to the low socioeconomic status of our school's community, our students often come in with more on their mind than learning fractions. Domestic violence, homelessness, and hunger are many things that our students face when they go home. 

At our school, we take pride in our students and truly care about how each student is feeling. We go out of our way to provide different services such as: counselors, snacks, plenty of hugs, etc., to handle each situation so that our students' can come right back and perform at their best. <!--DONOTREMOVEESSAYDIVIDER-->I am hoping to re-incorporate rich literature into my students daily school life. The students at my school are so plugged in, consumed by their devices, they sometimes miss out on broadening their imagination. 

We live in a world being consumed by technology. 

The books I am requesting coincide with a genre study, a study on a fantastic children's author, and so much more. I want to fill my library with high interest books to put in student hands alongside their technological devices. They will have the best of both worlds! I have a passion for children's literature that I hope to inspire my students with!

Preparing texts

We use a regular expression to clean up the donation texts:

Code
# remove noise
df$project_essay <- str_replace_all(df$project_essay, 
        pattern = '<!--DONOTREMOVEESSAYDIVIDER-->', 
        replacement = '\n\n') 
# validate
str_detect(df$project_essay[1], 
           '<!--DONOTREMOVEESSAYDIVIDER-->')
[1] FALSE

Text analysis using quanteda

  • a variety of R packages support quantitative text analyses. We will focus on quanteda, which is created and maintained by social scientists behind the Quanteda Initiative
  • other packages that might be interesting for you: tidytext, text2vec

Quanteda corpus object

You can create a quanteda corpus from (1) a character vector or (2) a data frame, which automatically includes meta data as document variables:

Code
donor_corp <- corpus(df, text_field = 'project_essay', 
                 docid_field = 'project_id')
docvars(donor_corp)$text <- df$project_essay # store unprocessed text
ndoc(donor_corp) # no. of documents
[1] 10000

Tokenization

Tokens can be created from a corpus or character vector. The documentation (?tokens()) illustrates several options, e.g. for the removal of punctuation

Code
donor_tokens <- tokens(donor_corp, 
                       remove_numbers = TRUE) # removing digits
donor_tokens[[1]][1:20] # text 1, first 20 tokens
 [1] "A"          "typical"    "day"        "in"         "our"       
 [6] "room"       "consists"   "of"         "a"          "lot"       
[11] "of"         "questions"  "coming"     "from"       "interested"
[16] "minds"      "."          "Our"        "students"   "'"         

Keywords in context (KWIC)

Corpus objects can be used to discover keywords in context (KWIC):

Code
kwic_donor <- kwic(donor_tokens, pattern = c("ipad"),
                      window = 5) # context window
head(kwic_donor, 3)
Keyword-in-context with 3 matches.                                                                             
  [54bea65a2cadc0f79f367fb1b76d6cfc, 15]      is very important. The | iPad |
  [54bea65a2cadc0f79f367fb1b76d6cfc, 29] use advanced technology. An | iPad |
 [54bea65a2cadc0f79f367fb1b76d6cfc, 188]      computer, much less an | iPad |
                           
 will also help my students
 would enhance my students'
 . The use of an           

Basic form of tokens

  • after tokenization text, some terms with similar semantic meaning might be regarded as different features (e.g. love, loving)
  • one solution is the application of stemming, which tries to reduce words to their basic form:
Code
words <- c("love", "loving", "lovingly", 
           "loved", "lover", "lovely")
char_wordstem(words, 'english')
[1] "love"  "love"  "love"  "love"  "lover" "love" 

To stem or not to stem?

  • whether stemming generates useful features or not varies by use case
  • in the context of topic modeling, some studies suggests that stemmers produce no meaningful improvement (for English language)
  • an alternative is lemmatization, available via packages likes spacyr and udpipe

Stopwords

Multiple preprocessing steps can be chained via the pipe operator, e.g normalizing to lowercase and removing common English stopwords:

Code
donor_tokens <- donor_tokens |> 
tokens_tolower() |> 
tokens_remove(stopwords('english'), 
              padding = TRUE) # keep empty strings

donor_tokens[[1]][1:10]
 [1] ""         "typical"  "day"      ""         ""         "room"    
 [7] "consists" ""         ""         "lot"     

Detecting collocations (phrases)

  • collocations are sequences of tokens which symbolize shared semantic meaning, e.g. United States
  • Quanteda can detect collocations with log-linear models. An important parameter is the minimum collocation frequency, which can be used to fine-tune results (see also textstat_select())
Code
colls <- textstat_collocations(donor_tokens,
         min_count = 200) # minimum frequency
donor_tokens_c <- tokens_compound(donor_tokens, colls) |> 
                  tokens_remove('') # remove empty strings
donor_tokens_c[[1]][1:5] # first five tokens of first text
[1] "typical_day" "room"        "consists"    "lot"         "questions"  

Document-Feature Matrix (DFM)

  • most models for automated text analysis require matrices as an input format
  • a common variant which directly translates to the bag of words format is the document term matrix (in quanteda: document-feature matrix):
doc_id I like hate currywurst
1 1 1 0 1
2 1 0 1 1

Creating a Document-Feature Matrix (dfm)

  • problem: textual data is highly dimensional -> dfms’s potentially grow to millions of rows & columns -> matrices for large text corpora don’t fit in memory
  • features are not uniformly distributed (see e.g. Zipf’s law), most cells contain zeroes
  • solution: sparse data format, which does not include zero counts. Quanteda natively implements DFM’s as sparse matrices

DFM’s in quanteda

Quanteda can create DFM’s from tokens and other DFM objects:

Code
dfm_donor <- dfm(donor_tokens_c)
dim(dfm_donor)
[1] 10000 28041

More preprocessing - feature trimming

As an alternative (or complement) to manually defining stopwords, terms occuring in either very few or almost all documents can be removed automatically.

Code
dfm_donor <- dfm_donor |> 
  dfm_keep(min_nchar = 2) |> # remove terms < 2 characters
  dfm_trim(min_docfreq = 0.001,  # 0.1% min
           max_docfreq = 0.50,  # 50% max
  docfreq_type = 'prop') # proportions instead of counts
dim(dfm_donor)
[1] 10000  6826

Inspecting most frequent terms

Code
textplot_wordcloud(dfm_donor, max_words = 100, color = 'black')

Inspecting most frequent terms

Code
textstat_frequency(dfm_donor) |> head(10)
      feature frequency rank docfreq group
1     reading      8276    1    3453   all
2        many      7938    2    4918   all
3        able      7933    3    4703   all
4         use      7804    4    4656   all
5       class      7299    5    4333   all
6        work      7266    6    4282   all
7        need      7178    7    4487   all
8       books      6916    8    2379   all
9  technology      5979    9    2578   all
10       love      5921   10    3773   all

Other pre-preprocessing procedures - ngrams

Features can be created from n sequences of tokens.

Code
text <- "to be or not to be"
tokens(text) |> tokens_ngrams(1:2) # unigrams + bigrams
Tokens consisting of 1 document.
text1 :
 [1] "to"     "be"     "or"     "not"    "to"     "be"     "to_be"  "be_or" 
 [9] "or_not" "not_to" "to_be" 
Code
tokens(text) |> tokens_ngrams(3) # trigrams only
Tokens consisting of 1 document.
text1 :
[1] "to_be_or"  "be_or_not" "or_not_to" "not_to_be"

Other prepreprocessing steps - tfidf

see dfm_tfidf() for the quanteda implementation

Other pre-processing steps we won’t cover today

Questions?